Dodoma
Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili
Atuhurra, Jesse, Shindo, Hiroyuki, Kamigaito, Hidetaka, Watanabe, Taro
Many attempts have been made in multilingual NLP to ensure that pre-trained language models, such as mBERT or GPT2 get better and become applicable to low-resource languages. To achieve multilingualism for pre-trained language models (PLMs), we need techniques to create word embeddings that capture the linguistic characteristics of any language. Tokenization is one such technique because it allows for the words to be split based on characters or subwords, creating word embeddings that best represent the structure of the language. Creating such word embeddings is essential to applying PLMs to other languages where the model was not trained, enabling multilingual NLP. However, most PLMs use generic tokenization methods like BPE, wordpiece, or unigram which may not suit specific languages. We hypothesize that tokenization based on syllables within the input text, which we call syllable tokenization, should facilitate the development of syllable-aware language models. The syllable-aware language models make it possible to apply PLMs to languages that are rich in syllables, for instance, Swahili. Previous works introduced subword tokenization. Our work extends such efforts. Notably, we propose a syllable tokenizer and adopt an experiment-centric approach to validate the proposed tokenizer based on the Swahili language. We conducted text-generation experiments with GPT2 to evaluate the effectiveness of the syllable tokenizer. Our results show that the proposed syllable tokenizer generates syllable embeddings that effectively represent the Swahili language.
Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons
Chen, Yuheng, Cao, Pengfei, Chen, Yubo, Liu, Kang, Zhao, Jun
Pre-trained language models (PLMs) contain vast amounts of factual knowledge, but how the knowledge is stored in the parameters remains unclear. This paper delves into the complex task of understanding how factual knowledge is stored in multilingual PLMs, and introduces the Architecture-adapted Multilingual Integrated Gradients method, which successfully localizes knowledge neurons more precisely compared to current methods, and is more universal across various architectures and languages. Moreover, we conduct an in-depth exploration of knowledge neurons, leading to the following two important discoveries: (1) The discovery of Language-Independent Knowledge Neurons, which store factual knowledge in a form that transcends language. We design cross-lingual knowledge editing experiments, demonstrating that the PLMs can accomplish this task based on language-independent neurons; (2) The discovery of Degenerate Knowledge Neurons, a novel type of neuron showing that different knowledge neurons can store the same fact. Its property of functional overlap endows the PLMs with a robust mastery of factual knowledge. We design fact-checking experiments, proving that the degenerate knowledge neurons can help the PLMs to detect wrong facts. Experiments corroborate these findings, shedding light on the mechanisms of factual knowledge storage in multilingual PLMs, and contribute valuable insights to the field. The code is available at https://github.com/heng840/AMIG.
Why LLaMa Is A Big Deal
You might have heard about LLaMa or maybe you haven't. In a nutshell, LLaMa is important because it allows you to run large language models (LLM) like GPT-3 on commodity hardware. In many ways, this is a bit like Stable Diffusion, which similarly allowed normal folks to run image generation models on their own hardware with access to the underlying source code. We've discussed why Stable Diffusion matters and even talked about how it works. LLaMa is a transformer language model from Facebook/Meta research, which is a collection of large models from 7 billion to 65 billion parameters trained on publicly available datasets.
Zipline Launches Medical Supply Drone Deliveries in Tanzania
Last month in Rwanda, a young woman started bleeding after giving birth by C-section. Try as they might, her doctors couldn't stop it. They'd already transfused the two units of matching blood that they had on-hand. They could have called the national blood bank in the capital of Kigali to request more, but ordering it, and sending it the 25 miles over mountainous roads to the hospital would take up to four hours. The woman didn't have that kind of time.
U.K. aid body funding drone deliveries aimed at saving mothers, babies in Tanzania
LONDON โ Drones delivering blood and medicine to rural areas of Tanzania could help to save the lives of many mothers and newborn babies in a country where one of the biggest causes of maternal deaths is blood loss during childbirth, the U.K. aid department said. The Department for International Development (DFID), which has given funding for the trial due to start early next year, said the drone deliveries could assist more than 50,000 births a year in the East African country. The drones will be able to carry up to 1 kg (2 pounds) of medical supplies and reduce delivery times to 19 minutes from the 110 minutes it takes on average by vehicle. "The U.K. is at the forefront of investing in cutting-edge technology to tackle the global challenges of today such as disease pandemics, medical emergencies and disaster responses," said Priti Patel, U.K.'s international development secretary. "This innovative, modern approach ensures we are achieving the best results for the world's poorest people and delivering value for money for British taxpayers," she said in a statement Thursday.
Drone-based blood deliveries in Tanzania to be funded by UK
The UK government is to fund a trial of drone-based deliveries of blood and other medical supplies in Tanzania. The goal is to radically reduce the amount of time it takes to send stock to health clinics in the African nation by road or other means. The scheme involves Zipline, a Silicon Valley start-up that began running a similar service in Rwanda in October. Experts praised that initiative but cautioned that "cargo drones" are still of limited use to humanitarian bodies. The Department for International Development (Dfid) has not said how much money will be invested in the Tanzanian effort or for how long.